2,558 research outputs found
Learning from Noisy Label Distributions
In this paper, we consider a novel machine learning problem, that is,
learning a classifier from noisy label distributions. In this problem, each
instance with a feature vector belongs to at least one group. Then, instead of
the true label of each instance, we observe the label distribution of the
instances associated with a group, where the label distribution is distorted by
an unknown noise. Our goals are to (1) estimate the true label of each
instance, and (2) learn a classifier that predicts the true label of a new
instance. We propose a probabilistic model that considers true label
distributions of groups and parameters that represent the noise as hidden
variables. The model can be learned based on a variational Bayesian method. In
numerical experiments, we show that the proposed model outperforms existing
methods in terms of the estimation of the true labels of instances.Comment: Accepted in ICANN201
A Real-Time Remote IDS Testbed for Connected Vehicles
Connected vehicles are becoming commonplace. A constant connection between
vehicles and a central server enables new features and services. This added
connectivity raises the likelihood of exposure to attackers and risks
unauthorized access. A possible countermeasure to this issue are intrusion
detection systems (IDS), which aim at detecting these intrusions during or
after their occurrence. The problem with IDS is the large variety of possible
approaches with no sensible option for comparing them. Our contribution to this
problem comprises the conceptualization and implementation of a testbed for an
automotive real-world scenario. That amounts to a server-side IDS detecting
intrusions into vehicles remotely. To verify the validity of our approach, we
evaluate the testbed from multiple perspectives, including its fitness for
purpose and the quality of the data it generates. Our evaluation shows that the
testbed makes the effective assessment of various IDS possible. It solves
multiple problems of existing approaches, including class imbalance.
Additionally, it enables reproducibility and generating data of varying
detection difficulties. This allows for comprehensive evaluation of real-time,
remote IDS.Comment: Peer-reviewed version accepted for publication in the proceedings of
the 34th ACM/SIGAPP Symposium On Applied Computing (SAC'19
TreeGrad: Transferring Tree Ensembles to Neural Networks
Gradient Boosting Decision Tree (GBDT) are popular machine learning
algorithms with implementations such as LightGBM and in popular machine
learning toolkits like Scikit-Learn. Many implementations can only produce
trees in an offline manner and in a greedy manner. We explore ways to convert
existing GBDT implementations to known neural network architectures with
minimal performance loss in order to allow decision splits to be updated in an
online manner and provide extensions to allow splits points to be altered as a
neural architecture search problem. We provide learning bounds for our neural
network.Comment: Technical Report on Implementation of Deep Neural Decision Forests
Algorithm. To accompany implementation here:
https://github.com/chappers/TreeGrad. Update: Please cite as: Siu, C. (2019).
"Transferring Tree Ensembles to Neural Networks". International Conference on
Neural Information Processing. Springer, 2019. arXiv admin note: text overlap
with arXiv:1909.1179
On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow
Abundant data is the key to successful machine learning. However, supervised
learning requires annotated data that are often hard to obtain. In a
classification task with limited resources, Active Learning (AL) promises to
guide annotators to examples that bring the most value for a classifier. AL can
be successfully combined with self-training, i.e., extending a training set
with the unlabelled examples for which a classifier is the most certain. We
report our experiences on using AL in a systematic manner to train an SVM
classifier for Stack Overflow posts discussing performance of software
components. We show that the training examples deemed as the most valuable to
the classifier are also the most difficult for humans to annotate. Despite
carefully evolved annotation criteria, we report low inter-rater agreement, but
we also propose mitigation strategies. Finally, based on one annotator's work,
we show that self-training can improve the classification accuracy. We conclude
the paper by discussing implication for future text miners aspiring to use AL
and self-training.Comment: Preprint of paper accepted for the Proc. of the 21st International
Conference on Evaluation and Assessment in Software Engineering, 201
Representation learning for cross-modality classification
Differences in scanning parameters or modalities can complicate image analysis based on supervised classification. This paper presents two representation learning approaches, based on autoencoders, that address this problem by learning representations that are similar across domains. Both approaches use, next to the data representation objective, a similarity objective to minimise the difference between representations of corresponding patches from each domain. We evaluated the methods in transfer learning experiments on multi-modal brain MRI data and on synthetic data. After transforming training and test data from different modalities to the common representations learned by our methods, we trained classifiers for each of pair of modalities. We found that adding the similarity term to the standard objective can produce representations that are more similar and can give a higher accuracy in these cross-modality classification experiments
ExplainIt! -- A declarative root-cause analysis engine for time series data (extended version)
We present ExplainIt!, a declarative, unsupervised root-cause analysis engine
that uses time series monitoring data from large complex systems such as data
centres. ExplainIt! empowers operators to succinctly specify a large number of
causal hypotheses to search for causes of interesting events. ExplainIt! then
ranks these hypotheses, reducing the number of causal dependencies from
hundreds of thousands to a handful for human understanding. We show how a
declarative language, such as SQL, can be effective in declaratively
enumerating hypotheses that probe the structure of an unknown probabilistic
graphical causal model of the underlying system. Our thesis is that databases
are in a unique position to enable users to rapidly explore the possible causal
mechanisms in data collected from diverse sources. We empirically demonstrate
how ExplainIt! had helped us resolve over 30 performance issues in a commercial
product since late 2014, of which we discuss a few cases in detail.Comment: SIGMOD Industry Track 201
Graph-based Features for Automatic Online Abuse Detection
While online communities have become increasingly important over the years,
the moderation of user-generated content is still performed mostly manually.
Automating this task is an important step in reducing the financial cost
associated with moderation, but the majority of automated approaches strictly
based on message content are highly vulnerable to intentional obfuscation. In
this paper, we discuss methods for extracting conversational networks based on
raw multi-participant chat logs, and we study the contribution of graph
features to a classification system that aims to determine if a given message
is abusive. The conversational graph-based system yields unexpectedly high
performance , with results comparable to those previously obtained with a
content-based approach
Dynamic Control of Explore/Exploit Trade-Off In Bayesian Optimization
Bayesian optimization offers the possibility of optimizing black-box
operations not accessible through traditional techniques. The success of
Bayesian optimization methods such as Expected Improvement (EI) are
significantly affected by the degree of trade-off between exploration and
exploitation. Too much exploration can lead to inefficient optimization
protocols, whilst too much exploitation leaves the protocol open to strong
initial biases, and a high chance of getting stuck in a local minimum.
Typically, a constant margin is used to control this trade-off, which results
in yet another hyper-parameter to be optimized. We propose contextual
improvement as a simple, yet effective heuristic to counter this - achieving a
one-shot optimization strategy. Our proposed heuristic can be swiftly
calculated and improves both the speed and robustness of discovery of optimal
solutions. We demonstrate its effectiveness on both synthetic and real world
problems and explore the unaccounted for uncertainty in the pre-determination
of search hyperparameters controlling explore-exploit trade-off.Comment: Accepted for publication in the proceedings of 2018 Computing
Conferenc
Large-scale diversity estimation through surname origin inference
The study of surnames as both linguistic and geographical markers of the past
has proven valuable in several research fields spanning from biology and
genetics to demography and social mobility. This article builds upon the
existing literature to conceive and develop a surname origin classifier based
on a data-driven typology. This enables us to explore a methodology to describe
large-scale estimates of the relative diversity of social groups, especially
when such data is scarcely available. We subsequently analyze the
representativeness of surname origins for 15 socio-professional groups in
France
The Poisson-Boltzmann model for implicit solvation of electrolyte solutions: Quantum chemical implementation and assessment via Sechenov coefficients.
We present the theory and implementation of a Poisson-Boltzmann implicit solvation model for electrolyte solutions. This model can be combined with arbitrary electronic structure methods that provide an accurate charge density of the solute. A hierarchy of approximations for this model includes a linear approximation for weak electrostatic potentials, finite size of the mobile electrolyte ions, and a Stern-layer correction. Recasting the Poisson-Boltzmann equations into Euler-Lagrange equations then significantly simplifies the derivation of the free energy of solvation for these approximate models. The parameters of the model are either fit directly to experimental observables-e.g., the finite ion size-or optimized for agreement with experimental results. Experimental data for this optimization are available in the form of Sechenov coefficients that describe the linear dependence of the salting-out effect of solutes with respect to the electrolyte concentration. In the final part, we rationalize the qualitative disagreement of the finite ion size modification to the Poisson-Boltzmann model with experimental observations by taking into account the electrolyte concentration dependence of the Stern layer. A route toward a revised model that captures the experimental observations while including the finite ion size effects is then outlined. This implementation paves the way for the study of electrochemical and electrocatalytic processes of molecules and cluster models with accurate electronic structure methods
- …